chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile) by donriddo · Pull Request #2400 · tetherto/qvac

donriddo · 2026-06-02T16:13:53Z

🎯 What problem does this PR solve?

The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.

📝 How does it solve it?

Coverage

Models: Qwen3.5-0.8B + 2B (5 quants each), keep Qwen3-1.7B as a desktop comparison baseline, drop Qwen3-4B. No PyTorch.
Reasoning budget -1 and 0; single ~512-token prompt (verified at ~518 templated tokens against the Qwen3.5 tokenizer).
Mobile KV-cache types f16, q8_0, q4_0, plus TurboQuant/PolarQuant (tbq3_0/pq3_0, tbq4_0/pq4_0, pq3_0, pq4_0); desktop runs GPU, mobile runs both gpu and cpu.

Report — unified renderer (render-report.js), one identical table per device (desktop + 5 mobile):

Columns: TTFT (ms) · TPS · ppTPS · Tokens, each as mean ± stddev across repeats (desktop 5, mobile 3).
Header records addon version, prompt size, repeats, and the detected desktop GPU — version + GPU are stamped into the run's artifacts so they're accurate and survive a later re-render.
Crashed rows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).
Best configuration per device (highest TPS, highest ppTPS).

Cross-run comparison (regression detection)

summarize_only re-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.
compare_run_id adds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.

Mobile execution

Sharded one group per (model × KV-cache type) = 70 shards (2 sizes × 5 quants × 7 KV-cache types), run as 7 sequential KV-cache batches to fit the Device Farm per-test ceiling and avoid pool/disk exhaustion. 3 measured repetitions per config.
The 70 shard files and the workflow's test_groups are generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.
Deliberately absent from test-groups.json; scheduled only via the workflow's test_groups override.

Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref, run_desktop, run_mobile, summarize_only, artifact_run_id, compare_run_id. The shared integration-mobile-test-llm-llamacpp.yml gains two additive optional inputs (job_timeout_minutes default 120, artifact_suffix default empty) — backward-compatible for other addon callers.

🧪 How was it tested?

npx standard clean; validate-mobile-tests.js in sync; verify:benchmark-shards confirms the matrix, the generated integration.auto.cjs (shard-file refs and run-function names), and the workflow test_groups are all in lockstep, so a generator change can't silently desync the Device Farm grep.
Generation pipeline verified locally: from a fresh checkout (shards absent) test:integration:generate regenerates everything with zero drift in the committed integration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.
Validated end-to-end across every input combination with real runs (full, desktop-only, mobile-only, re-render, comparison).
- One full run — desktop + the complete 70-shard mobile matrix in a single pass: https://github.com/tetherto/qvac/actions/runs/27490240383
  - Desktop sweep on the self-hosted GPU (Desktop (NVIDIA RTX 4000 SFF Ada Generation), desktop=5).
  - Mobile 70-shard matrix — 2 sizes × 5 quants × 7 KV-cache types (incl. TurboQuant/PolarQuant), mobile=3 with mean ± stddev and best-config per device. iPhone 16/17 report the full 70/70; the combos their GPUs don't support (Adreno quantized-KV, TurboQuant on Metal) surface as Crashed rows / coverage gaps, as intended.
- Per-batch wall-clock ran 54–114 min under the 180-min cap; the 7 KV-cache batches run sequentially.

💥 Known findings from the runs (data, not code issues)

Adreno GPU (Samsung S25/S26) crashes on all gpu + kv=q4_0 and gpu + kv=q8_0 — confirmed and reported as Crashed. CPU path handles quantized KV fine.
Mobile thermal throttling: on some mobile configs successive repeats get slower (e.g. ppTPS 850 -> 492 -> 428 across 3 reps), which widens the ± stddev on those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.
Pixel 9 Pro GPU TTFT/ppTPS are notably weaker than the other devices across quants, consistent with a Vulkan/driver characteristic; CPU results are plausible.

📦 Notes

Benchmark/test infrastructure only — no addon index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).
Pairs with #2382 (workflow infra, already merged).

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1215364805976158

gianni-cor · 2026-06-02T17:33:11Z

just for mobile, can you run the bench on both CPU and GPU?

github-actions · 2026-06-02T17:33:39Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

donriddo · 2026-06-02T18:26:06Z

just for mobile, can you run the bench on both CPU and GPU?

Already does. mobile.config.json sets "devices": ["gpu", "cpu"] and benchmark-perf.test.js loops over both, so each model and quant runs on CPU and GPU.

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

A comparison requested via compare_run_id renders delta columns against a baseline run. When the baseline produced no benchmark rows (e.g. only its run-meta/desktop-meta metadata artifacts were downloaded), the comparison was silently empty: the report rendered with no deltas and the job went green even though the requested comparison was never produced. render-report.js now exits non-zero when compareDir is set but the baseline has zero rows. This is distinct from a baseline that has rows but none matching the current devices, which still renders a per-device note.

The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.

jesusmb1995

Only possible .html report missing, maybe can be done in follow up.

The consolidated report is now over a thousand rows, which is hard to scan. render-report.js gains two visual outputs: - A Charts section embedded in the markdown as Mermaid xychart bars, so a device throughput ranking and the KV-cache / quantization comparison for the fastest device render inline in the GitHub step summary. - A --html output that writes a self-contained file (inline SVG, no deps or CDN) with the full per-device grouped charts and stddev error bars. The summarize job emits both; the markdown points viewers to the HTML file (uploaded with the report artifacts) for the full per-device view.

Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both reasoning-budget values on it. The warm-up was inside that reasoning-budget loop, so every backend warmed up twice. But the warm-up only primes the GPU kernels/caches for the loaded model, which the reasoning budget (a per-call generation param) does not change, so the second warm-up was pure overhead (~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.

…-llm-suite

The Stamp desktop device step interpolated the nvidia-smi GPU name directly into the printf inside its run block. Route it through a GPU_NAME env var so the value reaches the shell as data rather than as expanded workflow syntax, matching the env-mapping already used for the dispatch inputs elsewhere in this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform across every step.

…-llm-suite

donriddo · 2026-06-12T10:28:29Z

/review

The mobile chart helpers averaged a metric over every row sharing a (device, category) key, so a single bar blended both backends (gpu and cpu), both model sizes and both reasoning budgets — a value no real configuration produced — and its stddev whisker was the spread across those blended configs, not the measured 3-rep noise. Charts now hold every axis but the one on the x-axis at a fixed value (size 2B, reasoning budget -1, and the non-varied categorical at its default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization chart), so each bar is one measured configuration and its whisker is that config's own 3-rep stddev. gpu and cpu are charted separately and never blended, with a shared y-scale per metric. The inline mermaid is reduced to one device-ranking chart at a single stated config. Crashed configs remain missing bars rather than zeros, and the download note now names the real artifact (qwen35-benchmark-findings) and the file inside it.

Coverage compared the reported shards against the renderer's CURRENT matrix, so re-rendering an older run after the matrix grew showed it as falsely incomplete: a complete 30-shard run read 30/70 against today's 70-shard matrix. The stamp-version job now records the run's expected shard list into run-meta.json alongside the addon version, and coverage scores against that stamped list when present, falling back to the live matrix only for runs that predate the stamp. A re-render of a stamped run is therefore always scored against the matrix it actually targeted, while genuinely missing shards are still surfaced.

The report's chart note told readers to open qwen35-benchmark-charts.html but gave no link, so they had to scroll to the run's artifacts section and download it by hand. The renderer now takes an optional --charts-url and, when given, renders the artifact mention as a markdown link. The summarize job uploads the report first so the artifact's download URL is known, then substitutes that URL into the note before writing the run summary (falling back to the run page URL if the upload yields none). Local renders pass no URL and keep the plain text, so there is never a dangling link.

…-llm-suite

…ebuilds The desktop sweep ran on the GitHub-hosted GPU runner and built the addon from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant for ephemeral runners — destructive on a shared persistent runner. Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration tests use, and download the linux-x64 binary the prebuild job already produces instead of compiling on the runner. This adds the Manual Workspace Cleanup self-hosted runners need and drops the source build, the destructive disk cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for desktop-only dispatches so the binary is available to download.

…-llm-suite

The summarize job fetched the report artifacts with actions/download-artifact, which verifies the artifact digest and was failing with `digest-mismatch` on otherwise-intact artifacts (the gh CLI downloads the same files without issue). Under continue-on-error that left the input directory silently empty, so the render step reported a misleading "no benchmark reports found" and exited. Switch the current-run and baseline downloads to `gh run download`, which pulls the artifacts by name prefix and run id without the digest check, and emits a warning rather than masking a real failure.

The summarize job downloads the report artifacts with `gh run download`, which calls the Actions artifacts API and needs actions:read. The job only granted contents:read, so the download returned nothing and the render step reported "no benchmark reports found". Add actions:read.

…h artifacts" This reverts commit 1bedeb5.

…mmarize" This reverts commit 1b7f6b3.

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

prebuild now needs verify-shards, so a benchmark-shard matrix drift fails the run in ~30s instead of after the expensive prebuild. !cancelled() + the result check keep a desktop-only run (where verify-shards is skipped) working. The verify-shards comment is corrected to match.

donriddo requested review from a team as code owners June 2, 2026 16:13

donriddo mentioned this pull request Jun 2, 2026

infra: add benchmark-perf-llm-llamacpp workflow #2382

Merged

donriddo temporarily deployed to release June 2, 2026 19:51 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

donriddo added 3 commits June 10, 2026 18:20

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

e6e6911

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

docs: correct the matrix dimensions in the generator comment

7463be1

The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.

maxim-smotrov previously approved these changes Jun 10, 2026

View reviewed changes

jesusmb1995 previously approved these changes Jun 11, 2026

View reviewed changes

donriddo added 3 commits June 11, 2026 11:12

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

8b28a7a

…-llm-suite

jesusmb1995 previously approved these changes Jun 11, 2026

View reviewed changes

jpgaribotti previously approved these changes Jun 11, 2026

View reviewed changes

maxim-smotrov previously approved these changes Jun 11, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

50a0bd2

…-llm-suite

donriddo added 12 commits June 12, 2026 12:47

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

5fd6400

…-llm-suite

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

72a6091

…-llm-suite

Revert "fix: grant summarize actions:read so gh run download can fetc…

61688a5

…h artifacts" This reverts commit 1bedeb5.

Revert "fix: download benchmark report artifacts via the gh CLI in su…

516baec

…mmarize" This reverts commit 1b7f6b3.

Merge remote-tracking branch 'upstream/main' into feat/benchmark-perf…

c71b080

…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml

Merge branch 'main' into feat/benchmark-perf-llm-suite

2040180

ishanvohra2 mentioned this pull request Jun 16, 2026

QVAC-20699 infra[skiplog]: add desktop + mobile benchmark workflows for transcription-parakeet #2621

Merged

donriddo mentioned this pull request Jun 16, 2026

infra: add embed benchmark workflow surface (desktop + mobile) #2604

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400

chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
donriddo wants to merge 41 commits into
tetherto:mainfrom
donriddo:feat/benchmark-perf-llm-suite

donriddo commented Jun 2, 2026 •

edited

Loading

Uh oh!

gianni-cor commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

donriddo commented Jun 2, 2026

Uh oh!

This comment was marked as resolved.

jesusmb1995 left a comment

Uh oh!

donriddo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

donriddo commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

💥 Known findings from the runs (data, not code issues)

📦 Notes

Uh oh!

gianni-cor commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

donriddo commented Jun 2, 2026

Uh oh!

This comment was marked as resolved.

jesusmb1995 left a comment

Choose a reason for hiding this comment

Uh oh!

donriddo commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

donriddo commented Jun 2, 2026 •

edited

Loading

gianni-cor commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading